Exploratory Visualization on Ford GoBike Service Data

in SF Bay area (2-2019)

by Yasser Gharib

Table of Contents

Investigation Overview

Bike-sharing service like "Ford GoBike" is one of the rapidly growing transport services around the world, it has gained popularity in major cities across the globe. They allow people in metropolitan areas to rent bicycles for short trips usually within 30 minutes. Ford GoBike has collected a rich amount of data for this bicycle-sharing service from its electronic system in datasets. each dataset includes information about individual rides made in a bikeshare system covering a city for certain time.

In this project, an python exploratory visualization analysis is performed on the "Ford GoBike" dataset to figureout the relationship between riders featurs, and trips taken features like when (time periods), where (locations) and why are most trips taken.

Python visualization techniques is used to figure out what is the most influential power variables possess on the bike sharing service..

Dataset Overview

The project Dataset is provided by Ford GoBike sharing service at the greater San Francisco Bay area for ONE month (February 2019) which have thousands of bikes and trips features available. it’s the most fun, convenient, and affordable way to explore the region.

Setup Environment

Initialization

Set visualization style

Read data

Dataset Structure

This data set includes information about individual rides made in a bikeshare system covering the greater San Francisco Bay area for ONE month (February 2019) with 183,412 trips and 16 features.

Examine DataFrame

The 16 Features:

  • duration_sec: This has been given to us in seconds. A more natural unit of analysis will be if all the trip durations are given in minutes.
  • start_time, end_time: The time variables are for one month (February 2019), it is string, so for the analysis,it need to convert to datetime format and broken down into time of day, day of the week. We'll use start_time and durations only (end_time to calculate duration which i alrady have it)
  • The dataset provides membership birth year, so ages can be derived by using the year of the dataset, 2019, minus the membership birth year.
  • start_station_id, end_station_id: it is float64, it tell use the start and end stations id for each trip.
  • start_station_name, end_station_name: it tell use the start and end stations name for each trip.
  • (start_station_latitude, start_station_longitude) (end_station_latitude,end_station_longitude) that for putting the start station and end one on map or GIS, google map or calculate the stright line distant between start and end station (we'll not use it).
  • bike_id: int64,it is id No., telling which bike is used (we may use sum of duration time for each bick for maintance schaduale)
  • user_type: The data uses 'Subscriber' and 'Customer'.
  • member_birth_year : float64, The dataset provides membership birth year, so ages can be derived by using the year of the dataset, 2019, to divide by the membership birth year.
  • member_gender: Male vs. Female vs. Others.
  • bike_share_for_all_trip: yes/no , tell use the bick ability to share for all trip or not.
  • All features for the Trip like: start_time, end_time, duration_sec, start_station_name, end_station_name, start_station_latitude, start_station_longitude, end_station_latitude,end_station_longitude, which bick in bike_id

    but some for the bick like: bike_id, bike_share_for_all_trip

    and other for user like: user gender (member_gender), age (member_birth_year), user_type

    What is the main interesting dataset features that support the investigation?

    in this dataset, The most interested features will include like pick featues (start time/location, end time/location and duration) with riders characteristics (age, gender, and user_type) in figuring out the questions answer of when?, where? and why? most trips are taken. ¶

    Data Cleansing

    Updating data types

    Checking for missing values and duplicates

    Observation: Missing values were found in 6 features:

    member_gender, member_birth_year, start_station_id, start_station_name, end_station_id, end_station_name.

    Check if duplicates exist:

    Observation: No duplicates exist

    Drop missing data row due to low % (4.61%)

    for any Sckew chart, Transformation to like Normal Distribution will be done.

    Deriven Data

    Generate new field for day period (time period) from start_time

    Generate new field for Week Day from start_time

    for visualization Order of time period, and weekday

    for visualization wise, more Generating for new field will be done in its place.

    Filter data to include reasonable member age range

    There are outliers. Age from 18 to 55 takes 95% of the users. So, it's logical to remove users more than 60 years old. There were users more than 100 years old.

    The Ford bike users' median user age is around 33~34.

    Working with The most stations trips or with all station are better?

    let's check

    Start stations: there are 329 wirh different trafics.

    let's see it visual

    What about end station?

    end stations: there are 329 with different trafics.

    let's see it visual

    From the above 2 chart, the most trips count are 7 start and end stations which is diferent from the other stations in San Francisco

    The most trafic station (the first 7 station on the graph) will be used in this invistigation project.

    let's select and check this stations

    Subset the dataset by keeping only top 7 locations with high trips

    Where and Why most trips are taken?

    After checking the top (most trips) 7 start and end stations in San Francisco are taken becouse this most stations were connect to public transportations such as CalTrain, Metro (Berry) stations , Ferry building and Market Street.

    The top (most trips) 7 start and end stations are looks like the same,

    So the invistigation will be on the top 7 start station which are the most interested in the most traffic locations with over 2,500 trips:

  • Market St at 10th St

  • San Francisco Caltrain Station 2 (Townsend St at 4th St)

  • Berry St at 4th St

  • Montgomery St BART Station (Market St at 2nd St)

  • Powell St BART Station (Market St at 4th St)

  • San Francisco Ferry Building (Harry Bridges Plaza)

  • San Francisco Caltrain (Townsend St at 4th St)

  • Visualizing Data

    Univariate Exploration¶

    When the most trips are taken?¶

    by looking into the start time and start location of this dataset.

    Time:

    The distribution of time of day, weekday after subsetting, regenerate:

    In these top 7 trips stations, base on the above figures, we found:

  • During the day, there are more trips in the morning and afternoon than the night. It probably because of rush hours. Also, the number of trips in the afternoon is slightly less than the morning and beger than night. may be bick riders go in the morning and come back home in afternoon, and might not be back in the night.

  • It makes sense that there are more trips during the weekdays and less trips during the weekends because of working schedule.

  • The Duration of trips Distribution

    Long tail in the distribution

    there's a long tail in the distribution, so let's put it on a log scale instead

    From the figure, most durations of trips fall into 600 seconds (10.0 minutes). It looks normally distributed.

    The Relationship between user features and the most trips are taken features:¶

    User's characteristics:

    the distributions of user type and gender

    In these top 7 trips stations, base on the above figures, we found:

  • From priveise notes in Time Group, makes sense that the user is using bick every day and the top trips around the public transportion, so we found more subscribers than customers.becouse subscribers is low in cost.

  • For the gender groups, the number of trips in male users is 4 times more than the number of trips in females.

  • There are few bick users with 'Other' gender. It's not clear that the bick users are not willing to reveal or there are data entry issues, we will keep them in the dataset..

  • Age: based on the distribution

    It's right skewed.

    From the figure, most of bick users are around 30 years old. Even though there are some bick users ages older than 90 years old looks like high outliers, we will keep them in the dataset.

    Transformations

    The variables, age and duration_sec, have different types of skew, by using log transformat to be like Normal Distribution,

  • age's data has one big peak between 25 and 40 years old and some small peak.

  • Duration's data have one peak between 550 and 650 seconds.

  • Bivariate Exploration¶

    Based on the correlation, age is slightly negative correlated with duration, ie. age and duration is negatively correlated. in this dataset, the major population of age is between 30 and 40 years old and We have less samples in the older population.

    After breaking down into each station,

  • Time of day: morning is not necessary the period of time with most trips. 4 stations have the most trips during the morning and another 3 stations has the most trips during the afternoon. It needs to be investigated more.

  • Day of week: weekdays (Monday, Tuesday, Wednesday, Thursday and Friday) have the most trips than weekends. Compared to other weekdays, Sunday has less trips and Some stations' weekends have more trips than other stations (even their trips are still less than weekdays') might because these stations are close to tourist attractions. But all of points need to be taken a deeper look.

  • After breaking down into top 7 station by users' attributes:

    Apparently, subscribers are more than customers in each station. However, there are more customers at San Francisco Ferry Building (Harry Bridges Plaza). Customers might include tourists. The trips in male users are way more than in females. Even though I look into the gender distribution in SF. It cannot explain why males users are more. It needs to be investigated deeper.

    After log-transformed, most of median age population (between 30 to 40) is consistent in each station. The median of duration (second) falls around 650 second. However, after 1500 second, there are a lot of high outliers around 4.5%.

    Observed relationships in bivariate exploration.

    In the top 7 stations, look into the attributes' times and users:

    1.Time:

  • After separating into 7 stations, there are more trips in the morning and afternoon than the night. the number of trips in the afternoon is slightly less than the morning and beger than night.

  • TIt makes sense that there are more trips during the weekdays and less trips during the weekends because of working schedule.

    2.User:

  • Age: most of age population falls between 30 and 40 years old. It might imply there are full time employees and commuters.

  • Gender: the number of trips in males is way more than the number in females. It needs to be investigated more.

  • Subscribe: the number of trips in subscribers is more than the number in customers because of pricing and population.

  • Multivariate Exploration¶

    The most interesting variables are in locations and time with most trips. Now, we 'll study the effects ans trends after adding third or more variables.

    After separating customers from subscribers, there are some very interesting findings in these 3 time categorical variables.

  • Time of Day: there are more trips in the morning or afternoon no matter in customers or subscribers.

  • Weekdays: it implies customers probably includes tourists because most trips happen in the weekend. On the other hand, subscribers imply commuters because most trips happen in the weekdays.

  • After checking time of day and weekdays, females have most trips in the morning it is hard to tell any distinct trends between females and males. It needs to be investigated deeper and get more information.

    in Top 7 trips station by times (time of day, weekdays),
    3 Categorical variables and 1 numeric variable: too many groups here, So separate them by using FacetGrid with:

    In the age distribution, there are not big different in time and locations. Most medians of age fall between 30 and 40 years old.

    After log transformed, the trips are longer at night, on Saturday and on Sunday. let's test user types and the duration of trips impact.

    After separating subscribers from customers, the median of duration of trips in customers is between 600 and 1000 seconds. The trips at morning are longer.

    After separating subscribers from customers, the median of duration of trips in customers is between 600 and 1000 seconds. The trips at afternoon are longer.

    Features strengthen each other in terms of looking at locations and times

    Separating user types, customers and subscribers, displays more information from location and time. Customers might be tourists who like to use a bike during the weekend. Also, the number of trips increases in the tourist attractions like Ferry building and Embarcadero (close to piers). On the other hand, subscribers might be commuters. The trips in subscribers increase during the weekdays and afternoon.